Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 90
Filtrar
1.
Brief Bioinform ; 25(3)2024 Mar 27.
Artigo em Inglês | MEDLINE | ID: mdl-38609331

RESUMO

Natural language processing (NLP) has become an essential technique in various fields, offering a wide range of possibilities for analyzing data and developing diverse NLP tasks. In the biomedical domain, understanding the complex relationships between compounds and proteins is critical, especially in the context of signal transduction and biochemical pathways. Among these relationships, protein-protein interactions (PPIs) are of particular interest, given their potential to trigger a variety of biological reactions. To improve the ability to predict PPI events, we propose the protein event detection dataset (PEDD), which comprises 6823 abstracts, 39 488 sentences and 182 937 gene pairs. Our PEDD dataset has been utilized in the AI CUP Biomedical Paper Analysis competition, where systems are challenged to predict 12 different relation types. In this paper, we review the state-of-the-art relation extraction research and provide an overview of the PEDD's compilation process. Furthermore, we present the results of the PPI extraction competition and evaluate several language models' performances on the PEDD. This paper's outcomes will provide a valuable roadmap for future studies on protein event detection in NLP. By addressing this critical challenge, we hope to enable breakthroughs in drug discovery and enhance our understanding of the molecular mechanisms underlying various diseases.


Assuntos
Descoberta de Drogas , Processamento de Linguagem Natural , Transdução de Sinais
2.
Database (Oxford) ; 20222022 07 15.
Artigo em Inglês | MEDLINE | ID: mdl-35849027

RESUMO

In this research, we explored various state-of-the-art biomedical-specific pre-trained Bidirectional Encoder Representations from Transformers (BERT) models for the National Library of Medicine - Chemistry (NLM CHEM) and LitCovid tracks in the BioCreative VII Challenge, and propose a BERT-based ensemble learning approach to integrate the advantages of various models to improve the system's performance. The experimental results of the NLM-CHEM track demonstrate that our method can achieve remarkable performance, with F1-scores of 85% and 91.8% in strict and approximate evaluations, respectively. Moreover, the proposed Medical Subject Headings identifier (MeSH ID) normalization algorithm is effective in entity normalization, which achieved a F1-score of about 80% in both strict and approximate evaluations. For the LitCovid track, the proposed method is also effective in detecting topics in the Coronavirus disease 2019 (COVID-19) literature, which outperformed the compared methods and achieve state-of-the-art performance in the LitCovid corpus. Database URL: https://www.ncbi.nlm.nih.gov/research/coronavirus/.


Assuntos
COVID-19 , Mineração de Dados , Mineração de Dados/métodos , Humanos , Aprendizado de Máquina , Medical Subject Headings , PubMed
3.
BMC Bioinformatics ; 22(1): 389, 2021 Jul 30.
Artigo em Inglês | MEDLINE | ID: mdl-34330209

RESUMO

BACKGROUND: Antimicrobial peptides (AMPs) are oligopeptides that act as crucial components of innate immunity, naturally occur in all multicellular organisms, and are involved in the first line of defense function. Recent studies showed that AMPs perpetuate great potential that is not limited to antimicrobial activity. They are also crucial regulators of host immune responses that can modulate a wide range of activities, such as immune regulation, wound healing, and apoptosis. However, a microorganism's ability to adapt and to resist existing antibiotics triggered the scientific community to develop alternatives to conventional antibiotics. Therefore, to address this issue, we proposed Co-AMPpred, an in silico-aided AMP prediction method based on compositional features of amino acid residues to classify AMPs and non-AMPs. RESULTS: In our study, we developed a prediction method that incorporates composition-based sequence and physicochemical features into various machine-learning algorithms. Then, the boruta feature-selection algorithm was used to identify discriminative biological features. Furthermore, we only used discriminative biological features to develop our model. Additionally, we performed a stratified tenfold cross-validation technique to validate the predictive performance of our AMP prediction model and evaluated on the independent holdout test dataset. A benchmark dataset was collected from previous studies to evaluate the predictive performance of our model. CONCLUSIONS: Experimental results show that combining composition-based and physicochemical features outperformed existing methods on both the benchmark training dataset and a reduced training dataset. Finally, our proposed method achieved 80.8% accuracies and 0.871 area under the receiver operating characteristic curve by evaluating on independent test set. Our code and datasets are available at https://github.com/onkarS23/CoAMPpred .


Assuntos
Algoritmos , Aprendizado de Máquina , Simulação por Computador , Proteínas Citotóxicas Formadoras de Poros , Curva ROC
4.
Biology (Basel) ; 11(1)2021 Dec 21.
Artigo em Inglês | MEDLINE | ID: mdl-35053004

RESUMO

Interleukin (IL)-10 is a homodimer cytokine that plays a crucial role in suppressing inflammatory responses and regulating the growth or differentiation of various immune cells. However, the molecular mechanism of IL-10 regulation is only partially understood because its regulation is environment or cell type-specific. In this study, we developed a computational approach, ILeukin10Pred (interleukin-10 prediction), by employing amino acid sequence-based features to predict and identify potential immunosuppressive IL-10-inducing peptides. The dataset comprises 394 experimentally validated IL-10-inducing and 848 non-inducing peptides. Furthermore, we split the dataset into a training set (80%) and a test set (20%). To train and validate the model, we applied a stratified five-fold cross-validation method. The final model was later evaluated using the holdout set. An extra tree classifier (ETC)-based model achieved an accuracy of 87.5% and Matthew's correlation coefficient (MCC) of 0.755 on the hybrid feature types. It outperformed an existing state-of-the-art method based on dipeptide compositions that achieved an accuracy of 81.24% and an MCC value of 0.59. Our experimental results showed that the combination of various features achieved better predictive performance..

6.
Bioinformatics ; 37(3): 404-412, 2021 04 20.
Artigo em Inglês | MEDLINE | ID: mdl-32810217

RESUMO

MOTIVATION: Natural Language Processing techniques are constantly being advanced to accommodate the influx of data as well as to provide exhaustive and structured knowledge dissemination. Within the biomedical domain, relation detection between bio-entities known as the Bio-Entity Relation Extraction (BRE) task has a critical function in knowledge structuring. Although recent advances in deep learning-based biomedical domain embedding have improved BRE predictive analytics, these works are often task selective or use external knowledge-based pre-/post-processing. In addition, deep learning-based models do not account for local syntactic contexts, which have improved data representation in many kernel classifier-based models. In this study, we propose a universal BRE model, i.e. LBERT, which is a Lexically aware Transformer-based Bidirectional Encoder Representation model, and which explores both local and global contexts representations for sentence-level classification tasks. RESULTS: This article presents one of the most exhaustive BRE studies ever conducted over five different bio-entity relation types. Our model outperforms state-of-the-art deep learning models in protein-protein interaction (PPI), drug-drug interaction and protein-bio-entity relation classification tasks by 0.02%, 11.2% and 41.4%, respectively. LBERT representations show a statistically significant improvement over BioBERT in detecting true bio-entity relation for large corpora like PPI. Our ablation studies clearly indicate the contribution of the lexical features and distance-adjusted attention in improving prediction performance by learning additional local semantic context along with bi-directionally learned global context. AVAILABILITY AND IMPLEMENTATION: Github. https://github.com/warikoone/LBERT. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Bases de Conhecimento , Processamento de Linguagem Natural , Idioma , Projetos de Pesquisa , Semântica
7.
Brief Bioinform ; 21(6): 2219-2238, 2020 12 01.
Artigo em Inglês | MEDLINE | ID: mdl-32602538

RESUMO

Natural language processing (NLP) is widely applied in biological domains to retrieve information from publications. Systems to address numerous applications exist, such as biomedical named entity recognition (BNER), named entity normalization (NEN) and protein-protein interaction extraction (PPIE). High-quality datasets can assist the development of robust and reliable systems; however, due to the endless applications and evolving techniques, the annotations of benchmark datasets may become outdated and inappropriate. In this study, we first review commonlyused BNER datasets and their potential annotation problems such as inconsistency and low portability. Then, we introduce a revised version of the JNLPBA dataset that solves potential problems in the original and use state-of-the-art named entity recognition systems to evaluate its portability to different kinds of biomedical literature, including protein-protein interaction and biology events. Lastly, we introduce an ensembled biomedical entity dataset (EBED) by extending the revised JNLPBA dataset with PubMed Central full-text paragraphs, figure captions and patent abstracts. This EBED is a multi-task dataset that covers annotations including gene, disease and chemical entities. In total, it contains 85000 entity mentions, 25000 entity mentions with database identifiers and 5000 attribute tags. To demonstrate the usage of the EBED, we review the BNER track from the AI CUP Biomedical Paper Analysis challenge. Availability: The revised JNLPBA dataset is available at https://iasl-btm.iis.sinica.edu.tw/BNER/Content/Re vised_JNLPBA.zip. The EBED dataset is available at https://iasl-btm.iis.sinica.edu.tw/BNER/Content/AICUP _EBED_dataset.rar. Contact: Email: thtsai@g.ncu.edu.tw, Tel. 886-3-4227151 ext. 35203, Fax: 886-3-422-2681 Email: hsu@iis.sinica.edu.tw, Tel. 886-2-2788-3799 ext. 2211, Fax: 886-2-2782-4814 Supplementary information: Supplementary data are available at Briefings in Bioinformatics online.


Assuntos
Mineração de Dados , Armazenamento e Recuperação da Informação , Processamento de Linguagem Natural , Benchmarking , Biologia Computacional/métodos , Mineração de Dados/métodos , Bases de Dados Factuais , Redes Neurais de Computação , PubMed , Software , Inquéritos e Questionários
8.
BMC Genomics ; 21(1): 182, 2020 Feb 24.
Artigo em Inglês | MEDLINE | ID: mdl-32093618

RESUMO

BACKGROUND: Personal genomics and comparative genomics are becoming more important in clinical practice and genome research. Both fields require sequence alignment to discover sequence conservation and variation. Though many methods have been developed, some are designed for small genome comparison while some are not efficient for large genome comparison. Moreover, most existing genome comparison tools have not been evaluated the correctness of sequence alignments systematically. A wrong sequence alignment would produce false sequence variants. RESULTS: In this study, we present GSAlign that handles large genome sequence alignment efficiently and identifies sequence variants from the alignment result. GSAlign is an efficient sequence alignment tool for intra-species genomes. It identifies sequence variations from the sequence alignments. We estimate performance by measuring the correctness of predicted sequence variations. The experiment results demonstrated that GSAlign is not only faster than most existing state-of-the-art methods, but also identifies sequence variants with high accuracy. CONCLUSIONS: As more genome sequences become available, the demand for genome comparison is increasing. Therefore an efficient and robust algorithm is most desirable. We believe GSAlign can be a useful tool. It exhibits the abilities of ultra-fast alignment as well as high accuracy and sensitivity for detecting sequence variations.


Assuntos
Genoma , Genômica/métodos , Alinhamento de Sequência/métodos , Software , Algoritmos , Análise de Sequência de DNA
9.
Nucleic Acids Res ; 48(D1): D148-D154, 2020 01 08.
Artigo em Inglês | MEDLINE | ID: mdl-31647101

RESUMO

MicroRNAs (miRNAs) are small non-coding RNAs (typically consisting of 18-25 nucleotides) that negatively control expression of target genes at the post-transcriptional level. Owing to the biological significance of miRNAs, miRTarBase was developed to provide comprehensive information on experimentally validated miRNA-target interactions (MTIs). To date, the database has accumulated >13,404 validated MTIs from 11,021 articles from manual curations. In this update, a text-mining system was incorporated to enhance the recognition of MTI-related articles by adopting a scoring system. In addition, a variety of biological databases were integrated to provide information on the regulatory network of miRNAs and its expression in blood. Not only targets of miRNAs but also regulators of miRNAs are provided to users for investigating the up- and downstream regulations of miRNAs. Moreover, the number of MTIs with high-throughput experimental evidence increased remarkably (validated by CLIP-seq technology). In conclusion, these improvements promote the miRTarBase as one of the most comprehensively annotated and experimentally validated miRNA-target interaction databases. The updated version of miRTarBase is now available at http://miRTarBase.cuhk.edu.cn/.


Assuntos
Bases de Dados de Ácidos Nucleicos , MicroRNAs/metabolismo , MicroRNA Circulante/metabolismo , Mineração de Dados , Regulação da Expressão Gênica , RNA Mensageiro/metabolismo , Interface Usuário-Computador
10.
Biochemistry ; 59(34): 3078-3088, 2020 09 01.
Artigo em Inglês | MEDLINE | ID: mdl-31454239

RESUMO

Carbohydrates make up one of the four major classes of biomolecules, often conjugated with proteins as glycoproteins or with lipids as glycolipids, and participate in many important biochemical functions in living species. However, glycoproteins or glycolipids often exist as mixtures, and as a consequence, it is difficult to isolate individual glycoproteins or glycolipids as pure forms to understand the role carbohydrates play in the glycoconjugate. Currently, the only feasible way to obtain pure glycoconjugates is through synthesis, and of the many methods developed for the synthesis of oligosaccharides, those with automatic and programmable potential are considered to be more effective for addressing the issues of carbohydrate diversity and related functions. In this Perspective, we describe how data science, including algorithm and machine learning, can be used to assist the chemical synthesis of oligosaccharide in a programmable and one-pot manner and how the programmable method can be used to accelerate the construction of diverse oligosaccharides to facilitate our understanding of glycosylation in biology.


Assuntos
Técnicas de Química Sintética/métodos , Oligossacarídeos/síntese química , Aprendizado de Máquina , Oligossacarídeos/química
11.
Sci Rep ; 9(1): 15975, 2019 11 04.
Artigo em Inglês | MEDLINE | ID: mdl-31685900

RESUMO

N-linked glycosylation is one of the predominant post-translational modifications involved in a number of biological functions. Since experimental characterization of glycosites is challenging, glycosite prediction is crucial. Several predictors have been made available and report high performance. Most of them evaluate their performance at every asparagine in protein sequences, not confined to asparagine in the N-X-S/T sequon. In this paper, we present N-GlyDE, a two-stage prediction tool trained on rigorously-constructed non-redundant datasets to predict N-linked glycosites in the human proteome. The first stage uses a protein similarity voting algorithm trained  on both glycoproteins and non-glycoproteins to predict a score for a protein to improve glycosite prediction. The second stage uses a support vector machine to predict N-linked glycosites by utilizing features of gapped dipeptides, pattern-based predicted surface accessibility, and predicted secondary structure. N-GlyDE's final predictions are derived from a weight adjustment of the second-stage prediction results based on the first-stage prediction score. Evaluated on N-X-S/T sequons of an independent dataset comprised of 53 glycoproteins and 33 non-glycoproteins, N-GlyDE achieves an accuracy and MCC of 0.740 and 0.499, respectively, outperforming the compared tools. The N-GlyDE web server is available at http://bioapp.iis.sinica.edu.tw/N-GlyDE/ .

12.
J Vis Exp ; (151)2019 09 06.
Artigo em Inglês | MEDLINE | ID: mdl-31545316

RESUMO

This article presents a general experimental protocol for programmable one-pot oligosaccharide synthesis and demonstrates how to use Auto-CHO software for generating potential synthetic solutions. The programmable one-pot oligosaccharide synthesis approach is designed to empower fast oligosaccharide synthesis of large amounts using thioglycoside building blocks (BBLs) with the appropriate sequential order of relative reactivity values (RRVs). Auto-CHO is a cross-platform software with a graphical user interface that provides possible synthetic solutions for programmable one-pot oligosaccharide synthesis by searching a BBL library (containing about 150 validated and >50,000 virtual BBLs) with accurately predicted RRVs by support vector regression. The algorithm for hierarchical one-pot synthesis has been implemented in Auto-CHO and uses fragments generated by one-pot reactions as new BBLs. In addition, Auto-CHO allows users to give feedback for virtual BBLs to keep valuable ones for further use. One-pot synthesis of stage-specific embryonic antigen 4 (SSEA-4), which is a pluripotent human embryonic stem cell marker, is demonstrated in this work.


Assuntos
Oligossacarídeos/síntese química , Antígenos Embrionários Estágio-Específicos/química , Glicosilação , Humanos , Software
13.
J Am Med Inform Assoc ; 26(11): 1227-1236, 2019 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-31390470

RESUMO

OBJECTIVE: In this era of digitized health records, there has been a marked interest in using de-identified patient records for conducting various health related surveys. To assist in this research effort, we developed a novel clinical data representation model entitled medical knowledge-infused convolutional neural network (MKCNN), which is used for learning the clinical trial criteria eligibility status of patients to participate in cohort studies. MATERIALS AND METHODS: In this study, we propose a clinical text representation infused with medical knowledge (MK). First, we isolate the noise from the relevant data using a medically relevant description extractor; then we utilize log-likelihood ratio based weights from selected sentences to highlight "met" and "not-met" knowledge-infused representations in bichannel setting for each instance. The combined medical knowledge-infused representation (MK) from these modules helps identify significant clinical criteria semantics, which in turn renders effective learning when used with a convolutional neural network architecture. RESULTS: MKCNN outperforms other Medical Knowledge (MK) relevant learning architectures by approximately 3%; notably SVM and XGBoost implementations developed in this study. MKCNN scored 86.1% on F1metric, a gain of 6% above the average performance assessed from the submissions for n2c2 task. Although pattern/rule-based methods show a higher average performance for the n2c2 clinical data set, MKCNN significantly improves performance of machine learning implementations for clinical datasets. CONCLUSION: MKCNN scored 86.1% on the F1 score metric. In contrast to many of the rule-based systems introduced during the n2c2 challenge workshop, our system presents a model that heavily draws on machine-based learning. In addition, the MK representations add more value to clinical comprehension and interpretation of natural texts.


Assuntos
Ensaios Clínicos como Assunto/métodos , Mineração de Dados/métodos , Aprendizado Profundo , Redes Neurais de Computação , Seleção de Pacientes , Humanos , Processamento de Linguagem Natural , Máquina de Vetores de Suporte
14.
Anal Chem ; 91(15): 9403-9406, 2019 08 06.
Artigo em Inglês | MEDLINE | ID: mdl-31305071

RESUMO

Protein and peptide identification and quantitation are essential tasks in proteomics research and involve a series of steps in analyzing mass spectrometry data. Trans-Proteomic Pipeline (TPP) provides a wide range of useful tools through its web interfaces for analyses such as sequence database search, statistical validation, and quantitation. To utilize the powerful functionality of TPP without the need for manual intervention to launch each step, we developed a software tool, called WinProphet, to create and automatically execute a pipeline for proteomic analyses. It seamlessly integrates with TPP and other external command-line programs, supporting various functionalities, including database search for protein and peptide identification, spectral library construction and search, data-independent acquisition (DIA) data analysis, and isobaric labeling and label-free quantitation. WinProphet is a standalone, installation-free tool with graphical interfaces for users to configure, manage, and automatically execute pipelines. The constructed pipelines can be exported as XML files with all of the parameter settings for reusability and portability. The executable files, user manual, and sample data sets of WinProphet are freely available at  http://ms.iis.sinica.edu.tw/COmics/Software_WinProphet.html .


Assuntos
Análise de Dados , Proteômica/métodos , Software , Interface Usuário-Computador , Fluxo de Trabalho
15.
Database (Oxford) ; 20192019 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-30809637

RESUMO

The detection of MicroRNA (miRNA) mentions in scientific literature facilitates researchers with the ability to find relevant and appropriate literature based on queries formulated using miRNA information. Considering most published biological studies elaborated on signal transduction pathways or genetic regulatory information in the form of figure captions, the extraction of miRNA from both the main content and figure captions of a manuscript is useful in aggregate analysis and comparative analysis of the studies published. In this study, we present a statistical principle-based miRNA recognition and normalization method to identify miRNAs and link them to the identifiers in the Rfam database. As one of the core components in the text mining pipeline of the database miRTarBase, the proposed method combined the advantages of previous works relying on pattern, dictionary and supervised learning and provided an integrated solution for the problem of miRNA identification. Furthermore, the knowledge learned from the training data was organized in a human-interpretable manner to understand the reason why the system considers a span of text as a miRNA mention, and the represented knowledge can be further complemented by domain experts. We studied the ambiguity level of miRNA nomenclature to connect the miRNA mentions to the Rfam database and evaluated the performance of our approach on two datasets: the BioCreative VI Bio-ID corpus and the miRNA interaction corpus by extending the later corpus with additional Rfam normalization information. Our study highlights and also proposes a better understanding of the challenges associated with miRNA identification and normalization in scientific literature and the research gap that needs to be further explored in prospective studies.


Assuntos
MicroRNAs/metabolismo , Publicações , Estatística como Assunto , Algoritmos , Bases de Dados Genéticas , Internet , MicroRNAs/genética , Anotação de Sequência Molecular
16.
J Cheminform ; 10(1): 64, 2018 Dec 17.
Artigo em Inglês | MEDLINE | ID: mdl-30560325

RESUMO

The large number of chemical and pharmaceutical patents has attracted researchers doing biomedical text mining to extract valuable information such as chemicals, genes and gene products. To facilitate gene and gene product annotations in patents, BioCreative V.5 organized a gene- and protein-related object (GPRO) recognition task, in which participants were assigned to identify GPRO mentions and determine whether they could be linked to their unique biological database records. In this paper, we describe the system constructed for this task. Our system is based on two different NER approaches: the statistical-principle-based approach (SPBA) and conditional random fields (CRF). Therefore, we call our system SPBA-CRF. SPBA is an interpretable machine-learning framework for gene mention recognition. The predictions of SPBA are used as features for our CRF-based GPRO recognizer. The recognizer was developed for identifying chemical mentions in patents, and we adapted it for GPRO recognition. In the BioCreative V.5 GPRO recognition task, SPBA-CRF obtained an F-score of 73.73% on the evaluation metric of GPRO type 1 and an F-score of 78.66% on the evaluation metric of combining GPRO types 1 and 2. Our results show that SPBA trained on an external NER dataset can perform reasonably well on the partial match evaluation metric. Furthermore, SPBA can significantly improve performance of the CRF-based recognizer trained on the GPRO dataset.

17.
Nat Commun ; 9(1): 5202, 2018 12 06.
Artigo em Inglês | MEDLINE | ID: mdl-30523255

RESUMO

The programmable one-pot oligosaccharide synthesis method was designed to enable the rapid synthesis of a large number of oligosaccharides, using the software Optimer to search Building BLocks (BBLs) with defined relative reactivity values (RRVs) to be used sequentially in the one-pot reaction. However, there were only about 50 BBLs with measured RRVs in the original library and the method could only synthesize small oligosaccharides due to the RRV ordering requirement. Here, we increase the library to include 154 validated BBLs and more than 50,000 virtual BBLs with predicted RRVs by machine learning. We also develop the software Auto-CHO to accommodate more data handling and support hierarchical one-pot synthesis using fragments as BBLs generated by the one-pot synthesis. This advanced programmable one-pot method provides potential synthetic solutions for complex glycans with four successful examples demonstrated in this work.


Assuntos
Biologia Computacional/métodos , Oligossacarídeos/química , Oligossacarídeos/síntese química , Software , Sequência de Carboidratos , Modelos Químicos , Estrutura Molecular , Polissacarídeos/síntese química , Polissacarídeos/química , Reprodutibilidade dos Testes
18.
Database (Oxford) ; 20182018 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-30346607

RESUMO

Identifying the interactions between chemical compounds and genes from biomedical literatures is one of the frequently discussed topics of text mining in the life science field. In this paper, we describe Linguistic Pattern-Aware Dependency Tree Kernel, a linguistic interaction pattern learning method developed for CHEMPROT task-BioCreative VI, to capture chemical-protein interaction (CPI) patterns within biomedical literatures. We also introduce a framework to integrate these linguistic patterns with smooth partial tree kernel to extract the CPIs. This new method of feature representation models aspects of linguistic probability in geometric representation, which not only optimizes the sufficiency of feature dimension for classification, but also defines features as interpretable contexts rather than long vectors of numbers. In order to test the robustness and efficiency of our system in identifying different kinds of biological interactions, we evaluated our framework on three separate data sets, i.e. CHEMPROT corpus, Chemical-Disease Relation corpus and Protein-Protein Interaction corpus. Corresponding experiment results demonstrate that our method is effective and outperforms several compared systems for each data set.


Assuntos
Algoritmos , Bases de Dados de Compostos Químicos , Linguística , Proteínas/química , Mapas de Interação de Proteínas , Máquina de Vetores de Suporte
19.
Bioinformatics ; 34(2): 190-197, 2018 Jan 15.
Artigo em Inglês | MEDLINE | ID: mdl-28968831

RESUMO

MOTIVATION: In recent years, the massively parallel cDNA sequencing (RNA-Seq) technologies have become a powerful tool to provide high resolution measurement of expression and high sensitivity in detecting low abundance transcripts. However, RNA-seq data requires a huge amount of computational efforts. The very fundamental and critical step is to align each sequence fragment against the reference genome. Various de novo spliced RNA aligners have been developed in recent years. Though these aligners can handle spliced alignment and detect splice junctions, some challenges still remain to be solved. With the advances in sequencing technologies and the ongoing collection of sequencing data in the ENCODE project, more efficient alignment algorithms are highly demanded. Most read mappers follow the conventional seed-and-extend strategy to deal with inexact matches for sequence alignment. However, the extension is much more time consuming than the seeding step. RESULTS: We proposed a novel RNA-seq de novo mapping algorithm, call DART, which adopts a partitioning strategy to avoid the extension step. The experiment results on synthetic datasets and real NGS datasets showed that DART is a highly efficient aligner that yields the highest or comparable sensitivity and accuracy compared to most state-of-the-art aligners, and more importantly, it spends the least amount of time among the selected aligners. AVAILABILITY AND IMPLEMENTATION: https://github.com/hsinnan75/DART. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

20.
Nucleic Acids Res ; 46(D1): D296-D302, 2018 01 04.
Artigo em Inglês | MEDLINE | ID: mdl-29126174

RESUMO

MicroRNAs (miRNAs) are small non-coding RNAs of ∼ 22 nucleotides that are involved in negative regulation of mRNA at the post-transcriptional level. Previously, we developed miRTarBase which provides information about experimentally validated miRNA-target interactions (MTIs). Here, we describe an updated database containing 422 517 curated MTIs from 4076 miRNAs and 23 054 target genes collected from over 8500 articles. The number of MTIs curated by strong evidence has increased ∼1.4-fold since the last update in 2016. In this updated version, target sites validated by reporter assay that are available in the literature can be downloaded. The target site sequence can extract new features for analysis via a machine learning approach which can help to evaluate the performance of miRNA-target prediction tools. Furthermore, different ways of browsing enhance user browsing specific MTIs. With these improvements, miRTarBase serves as more comprehensively annotated, experimentally validated miRNA-target interactions databases in the field of miRNA related research. miRTarBase is available at http://miRTarBase.mbc.nctu.edu.tw/.


Assuntos
Bases de Dados Genéticas , MicroRNAs/metabolismo , RNA Mensageiro/metabolismo , Mineração de Dados , Humanos , RNA Mensageiro/química , Interface Usuário-Computador
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...